Search CORE

214 research outputs found

Recommended from our members

Automatic Identification of Errors in Arabic Handwriting Recognition

Author: Habash Nizar
Habash Nizar Y.
Roth Ryan
Roth Ryan M.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2010
Field of study

Arabic handwriting recognition (HR) is a challenging problem due to Arabic's connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. We also consider a learning curve varying in two dimensions: number of segments and number of n-best hypotheses to train on. We additionally evaluate the performance on different test sets with different degrees of errors in them. Our best approach achieves a roughly ~20% absolute increase in F-score over a simple but reasonable baseline. A detailed error analysis shows that linguistic features, such as lemma models, help improve HR-error detection precisely where we expect them to: semantically inconsistent error words

Columbia University Academic Commons

Recommended from our members

CATiB: The Columbia Arabic Treebank

Author: Habash Nizar
Habash Nizar Y.
Roth Ryan
Roth Ryan M.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2009
Field of study

The Columbia Arabic Treebank (CATiB) is a resource for Arabic parsing. CATiB contrasts with previous efforts on Arabic treebanking and treebanking of morphologically rich languages in that it encodes less linguistic information in the interest of speedier annotation of large amounts of text. This paper describes CATiB's representation and annotation procedure, and reports on achieved inter-annotator agreement and annotation speed

Columbia University Academic Commons

Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation

Author: Habash Nizar Y.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2008
Field of study

We present four techniques for online handling of Out-of-Vocabulary words in Phrasebased Statistical Machine Translation. The techniques use spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis

CiteSeerX

Columbia University Academic Commons

LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual

Author: Diab Mona
Habash Nizar
Rambow Owen
Roth Ryan
Publication venue
Publication date: 01/01/2013
Field of study

The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research. Among these are a number of annotated treebank corpora for Arabic. Typically, these corpora consist of a single collection of annotated documents. NLP research, however, usually requires multiple data sets for the purposes of training models, developing techniques, and final evaluation. Therefore it becomes necessary to divide the corpora used into the required data sets (divisions). This document details a set of rules that have been defined to enable consistent divisions for old and new Arabic treebanks (ATB) and related corpora.Comment: 14 pages; one cove

arXiv.org e-Print Archive

Columbia University Academic Commons

Arabic Preprocessing Schemes for Statistical Machine Translation

Author: Habash Nizar Y.
Sadat Fatiha
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data

CiteSeerX

Crossref

NRC Publications Archive

Columbia University Academic Commons

Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic.

Author: Nizar Habash
Nizar Habash
Wael Salloum
Wael Salloum
Publication venue
Publication date: 01/01/2013
Field of study

Abstract Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resources. In comparison, resources for dialectal Arabic (DA), the unstandardized spoken varieties of Arabic, are still lacking. We present ELISSA, a machine translation (MT) system for DA to MSA. ELISSA employs a rule-based approach that relies on morphological analysis, transfer rules and dictionaries in addition to language models to produce MSA paraphrases of DA sentences. ELISSA can be employed as a general preprocessor for DA when using MSA NLP tools. A manual error analysis of ELISSA's output shows that it produces correct MSA translations over 93% of the time. Using ELISSA to produce MSA versions of DA sentences as part of an MSA-pivoting DA-to-English MT solution, improves BLEU scores on multiple blind test sets between 0.6% and 1.4%

CiteSeerX